This project is part of the Udacity’s Data Analyst Nanodegree Program. I conducted an Exploratory Data Analysis (EDA) on one of the curated data sets, provided by Udacity, from Prosper, which is America’s first marketplace lending platform, with over $12 billion in funded loans as of June 2021.
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The dataset contains loans created between 2005-Q4 and 2014-Q1 with a last updated date of 03/11/2014. Here’s a link to the variable definitions for this dataset.
The main purpose of this project is to summarize the characteristics of variables that can affect the loan status and to get some ideas about the relationships among multiple variables using summary statistics and data visualizations.
First, we load the Prosper Loan Data. Here are the observations and variable counts.
## [1] 113937 81
Here are the object names of the dataset.
## [1] "ListingKey" "ListingNumber"
## [3] "ListingCreationDate" "CreditGrade"
## [5] "Term" "LoanStatus"
## [7] "ClosedDate" "BorrowerAPR"
## [9] "BorrowerRate" "LenderYield"
## [11] "EstimatedEffectiveYield" "EstimatedLoss"
## [13] "EstimatedReturn" "ProsperRating..numeric."
## [15] "ProsperRating..Alpha." "ProsperScore"
## [17] "ListingCategory..numeric." "BorrowerState"
## [19] "Occupation" "EmploymentStatus"
## [21] "EmploymentStatusDuration" "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup" "GroupKey"
## [25] "DateCreditPulled" "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper" "FirstRecordedCreditLine"
## [29] "CurrentCreditLines" "OpenCreditLines"
## [31] "TotalCreditLinespast7years" "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment" "InquiriesLast6Months"
## [35] "TotalInquiries" "CurrentDelinquencies"
## [37] "AmountDelinquent" "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years" "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance" "BankcardUtilization"
## [43] "AvailableBankcardCredit" "TotalTrades"
## [45] "TradesNeverDelinquent..percentage." "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio" "IncomeRange"
## [49] "IncomeVerifiable" "StatedMonthlyIncome"
## [51] "LoanKey" "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled" "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate" "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed" "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing" "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber" "LoanMonthsSinceOrigination"
## [63] "LoanNumber" "LoanOriginalAmount"
## [65] "LoanOriginationDate" "LoanOriginationQuarter"
## [67] "MemberKey" "MonthlyLoanPayment"
## [69] "LP_CustomerPayments" "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees" "LP_ServiceFees"
## [73] "LP_CollectionFees" "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss" "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded" "Recommendations"
## [79] "InvestmentFromFriendsCount" "InvestmentFromFriendsAmount"
## [81] "Investors"
Here’s the initial structure of the dataframe.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : chr "1021339766868145413AB3B" "10273602499503308B223C1" "0EE9337825851032864889A" "0EF5356002482715299901A" ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : chr "2007-08-26 19:09:29.263000000" "2014-02-27 08:28:07.900000000" "2007-01-05 15:00:47.090000000" "2012-10-22 11:02:35.010000000" ...
## $ CreditGrade : chr "C" "" "HR" "" ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : chr "Completed" "Current" "Completed" "Current" ...
## $ ClosedDate : chr "2009-08-14 00:00:00" "" "2009-12-17 00:00:00" "" ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : chr "" "A" "" "A" ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : chr "CO" "CO" "GA" "GA" ...
## $ Occupation : chr "Other" "Professional" "Other" "Skilled Labor" ...
## $ EmploymentStatus : chr "Self-employed" "Employed" "Not available" "Employed" ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : chr "True" "False" "False" "True" ...
## $ CurrentlyInGroup : chr "True" "False" "True" "False" ...
## $ GroupKey : chr "" "" "783C3371218786870A73D20" "" ...
## $ DateCreditPulled : chr "2007-08-26 18:41:46.780000000" "2014-02-27 08:28:14" "2007-01-02 14:09:10.060000000" "2012-10-22 11:02:32" ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : chr "2001-10-11 00:00:00" "1996-03-18 00:00:00" "2002-07-27 00:00:00" "1983-02-28 00:00:00" ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : chr "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
## $ IncomeVerifiable : chr "True" "True" "True" "True" ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : chr "E33A3400205839220442E84" "9E3B37071505919926B1D82" "6954337960046817851BCB2" "A0393664465886295619C51" ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : chr "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
## $ LoanOriginationQuarter : chr "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
## $ MemberKey : chr "1F3E3376408759268057EDA" "1D13370546739025387B2F4" "5F7033715035555618FA612" "9ADE356069835475068C6D2" ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
|| || || ||
Next, I will use a subset of this dataset for performing an analysis. The main purpose of this analysis is to explore the various factors that affect borrowers’ rates. The following are the variables I would like to use.
## [1] "LoanNumber" "LoanOriginalAmount"
## [3] "LoanOriginationDate" "LoanOriginationQuarter"
## [5] "Term" "LoanStatus"
## [7] "BorrowerRate" "LenderYield"
## [9] "ProsperRating..Alpha." "ProsperScore"
## [11] "ListingCategory..numeric." "EmploymentStatus"
## [13] "IsBorrowerHomeowner" "CreditScoreRangeLower"
## [15] "CreditScoreRangeUpper" "InquiriesLast6Months"
## [17] "PublicRecordsLast10Years" "IncomeRange"
## [19] "IncomeVerifiable"
Based on my observation of the variables in this dataset, there are data that are not in good form.
As a result, I’ll transform Term and ListingCategory columns from ‘numeric’ to ‘factor’, and transform ListingCreationDate column from ‘factor’ to ‘Date’. The following is the dataset’s structure transformed.
## 'data.frame': 113937 obs. of 19 variables:
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : chr "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
## $ LoanOriginationQuarter : chr "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
## $ Term : Factor w/ 3 levels "1 year","3 years",..: 2 2 2 2 2 3 2 2 2 2 ...
## $ LoanStatus : chr "Completed" "Current" "Completed" "Current" ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ ProsperRating : Factor w/ 8 levels "AA","A","B","C",..: NA 2 NA 2 5 3 6 4 1 1 ...
## $ ProsperScore : Factor w/ 11 levels "1","2","3","4",..: NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory : Factor w/ 21 levels "NA","Debt Consolidation",..: 1 3 1 17 3 2 2 3 8 8 ...
## $ EmploymentStatus : chr "Self-employed" "Employed" "Not available" "Employed" ...
## $ IsBorrowerHomeowner : chr "True" "False" "False" "True" ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ PublicRecordsLast10Years: int 0 1 0 0 0 0 0 1 0 0 ...
## $ IncomeRange : chr "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
## $ IncomeVerifiable : chr "True" "True" "True" "True" ...
In this section, I will perform some preliminary exploration of the transformed dataset. I will run some summaries of the data and create univariate plots to understand the structure of the individual variables in this dataset.
First, let’s take a look on various categorical variables starting with Term.
##
## 1 year 3 years 5 years
## 0.01416572 0.77040821 0.21542607
Above 90% of the loans in this dataset are with term more than 1 year. Most of them is of 3 years.
##
## Cancelled Chargedoff Completed
## 0.0000438839 0.1052511476 0.3341671274
## Current Defaulted FinalPaymentInProgress
## 0.4965551138 0.0440418828 0.0017992399
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 0.0001404285 0.0070740848 0.0023258467
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 0.0031859712 0.0027471322 0.0026681412
There are four major loan status: “Defaulted”, “Current”, “Completed”, and “Chargedoff.”
##
## AA A B C D E HR
## 0.06330949 0.17148480 0.18362344 0.21619742 0.16822033 0.11543493 0.08172958
## NA
## 0.00000000
The above bar chart shows Prosper’s rating distribution. The unrated loans are removed in this plot.
##
## Employed Full-time Not available Not employed
## 0.019791639 0.590870393 0.231312041 0.046929443 0.007328611
## Other Part-time Retired Self-employed
## 0.033404425 0.009549137 0.006977540 0.053836769
The majority of Prosper users are either employed or have a full-time job.
About half of the users are homeowners.
##
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 0.005450380 0.063842299 0.152163037 0.282542107 0.272519024
## $75,000-99,999 Not displayed Not employed
## 0.148468013 0.067941055 0.007074085
About 70% of borrowers’ income are in the range of $25,000 - $100,000
Most of borrowers have their income verified.
Next, let’s take a look on various numerical variables starting with BorrowerRate.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1242 0.1730 0.1827 0.2400 0.4925
The BorrowerRate has a right-skewed distribution, with a mean of 19.28%, and median of 18.40%. Similarly, the LenderYield also has a right-skewed distribution, with a mean of 18.27%, and median of 17.30%.
The ranges of credit scores are small compared to the distributions. I will use lower credit score for further analyses.
## [1] 3 0 1 7 2 5 4 11 6 9 44 NA 8 12 19 42 27 10 15
## [20] 13 21 17 38 16 14 18 22 25 28 20 23 26 29 24 40 41 30 33
## [39] 31 32 34 35 36 63 97 37 46 52 105 53 50
The distribution of InquiriesLast6Months is right-skewed and long-tailed. Outliers like 105 and 53 exist.
## [1] 0 1 2 5 NA 3 4 7 6 11 8 12 9 10 15 21 13 25 38 14 16 30 20 34 22
## [26] 17
The distribution of PublicRecordsLast10Years is right-skewed and has some outliers.
LoanOriginalAmount also displays a long-tailed distribution, with some numbers especially higher. The reason why these numbers are more frequent might be because they are the exact numbers (eg. 10,000, 15,000, 20,000) that people tend to choose.
This is a very large dataset with 113937 loan records and 81 features on each loan. For simplicity with the analysis, I only use a subset of the original dataset. The subset has same number of records (113,937), but only with 15 features that I want to explore. Those features (after transformation) are as follows:
Categorical: - Term - LoanStatus - ProsperRating - ListingCategory - EmploymentStatus - IsBorrowerHomeowner - IncomeRange - IncomeVerifiable
Numerical: - BorrowerRate - LenderYield - CreditScoreRangeLower - CreditScoreRangeUpper - InquiriesLast6Months - PublicRecordsLast10Years - LoanOriginalAmount
The main features of interest in this dataset are BorrowerRate and ProsperRating, where ProsperRating might be the overall assessments of the borrowers performed by Prosper.
The features regarding borrowers’ status (EmploymentStatus, IsBorrowerHomeowner, etc.) and those that reveals borrowers’ past payment history (PublicRecords, Inquiries, etc.)
No, I didn’t create any new variables. I only converted some variables from numeric to factor. Those variables are either intrinsically categorical or they have only several possible values.
I observed some unusal distributions in the histogram of LoanOriginalAmount. The frequencies are higher around 4,000, 5,000, 10,000, 15,000 and so on. I think that’s because these numbers are just easy to pick for most borrowers.
The dataset is tidy and I did not perform other transformation other than changing their classes (ex. numeric -> factor )
The correlation matrix shows that there are two strong positive relationships among the features. The first one is the trivial one: credit score upper range and lower range. And the second one is the borrower rate and the lender yield.
## [1] 0.9992113
The borrower rate and the lender yield have nearly perfect linear relationship. This is because Prosper charges a fixed rate on loans so the borrower rate and the lender yield should be strongly positively related.
This plot shows that Prosper ratings do affect borrowers’ rates.
Next, I want to inspect other factors that also have impact on the borrower rate and the Prosper rating. I’ll start with the borrower rate.
## $`1 year`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.0929 0.1434 0.1501 0.2064 0.2669
##
## $`3 years`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1274 0.1815 0.1935 0.2599 0.4975
##
## $`5 years`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0669 0.1490 0.1870 0.1930 0.2319 0.3304
The average rate of the 3-year’s is larger than the 1-year’s, but is very close to 5-year loan’s average. And 5-year loans have smaller range than 3-year loans.
## [1] -0.3289599
There exists a fairly weak negative relationship between the borrower rate and the original loan amount.
It seems that some categories have lower rates and others have higher rates.
Borrowers with their income verifiable or are homeowners seem to have lower rates, which makes sense.
Finally in this section, I want to what affect Prosper rating.
A higher credit score by the consumer rating agency also tends to have a higher Prosper rating.
For top ProsperRating groups like “AA” and “A”, Homeowners are the majority.
It seems very hard to get top rating when you don’t have your income verfied.
First, the lender yield and borrower rate are strongly related because the fixed rate charged by Prosper. Second, the borrower rate and Prosper rating do vary with borrowers’ status.
The relationship between list category and the borrower rate is interesting. In average, cosmetic procedure’s rates are relatively higher, whereas boat’s rates are lower.
The strongest relationship I found is the one between the borrower rate and the lender yield, with a correlation coefficient very close to 1.
## AA A B C D E HR NA NA's
## 5372 14551 15581 18345 14274 9795 6935 0 29084
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
Using jitter plot, we can visualize the real distribution of the borrrower rate relative to Prosper rating. In addition, I use different colors to indicate whether the individual borrower is a homeowner. This plot shows that most of the “AA” rated borrowers are also homeowners.
In this plot I found several interesting characteristics. First, each rating has its own colored area that is easily recognized, which indicates that Prosper rating is a good predictor for the borrower rate. Second, only highly rated loans (“AA”, “A”, “B”) can have original amount over $30,000, and most lowly rated loans (“E”, “HR”) are below 10,000.
## # A tibble: 6 x 3
## # Groups: ProsperRating [1]
## ProsperRating ListingCategory BorrowerRate
## <fct> <fct> <dbl>
## 1 AA NA 0.0655
## 2 AA Debt Consolidation 0.0800
## 3 AA Home Improvement 0.0789
## 4 AA Business 0.0823
## 5 AA Student Use 0.0821
## 6 AA Auto 0.0759
This plot shows the relative value of the averaged borrower rates of different list categories.
Compared to the previous plot, the boxplots give us more information about the outliers and range of the distribution. But it’s more difficult to compare mean values between different categories.
We can know from this plot that “Boat”, “Green loans”, “RV” categories have less data points. This explains why they have less outliers in the previous plots.
The heat map shows that highly rated “Not employed” borrowers have to pay slight higher rates. The jitter plots shows that even “Not employed”, “Retired”, or “Part-time” borrowers can get loan with rates lower than 0.2.
## [1] "mean"
## 1 year 3 years 5 years
## 0.1500807 0.1934855 0.1929907
## [1] "standard deviation"
## 1 year 3 years 5 years
## 0.06785817 0.07925234 0.05566590
Most loans are either 3-year or 5-year loans, and they have higher borrower rates, and the main difference between 3-year loans and 5-year loans is deivation.
Term loan is quite a good indicator whether we have a better Lender Yield or not. Also, we see how three variables Lender Yield, Prosper Rating and Debt To Income Ratio come together and how it affect each order.
The criteria for being an AA borrowers seems to be tighten over the year and a there seems to be a fixed borrower rate in criteria HR and AA.
No
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
The Prosper rating is an important factor for the borrower rate. High Prosper rating is linked to lower borrower rates.
Only highly rated loans (“AA”, “A”, “B”) can have original amount over $30,000, and most lowly rated loans (“E”, “HR”) are below 10,000. Again Prosper rating is an great indicator of the borrower rate. The different colored areas can be easily recognized.
The 3-year loans have higher borrower rates than the 1-year loans in each Prosper rating. The 5-year loans are more concentrated than the 3-year loans and have smaller total range.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!
The Prosper loan dataset contains up to 81 variables, which makes it hard to explore at the beginning. If I just randomly explore each variable in term, there would be no end and I probably would forget things along the way. To solve this problem, I decided to only explore the variables related to the borrower rates and the borrowers’ statuses. I subeted the dataset into a smaller one with only 15 variables.
In my subset dataset, there were only 7 numerical variables and my main interest was between the borrower rate and the various borrower statuses. Therefore, the majority of the visualizations I created included at least one categorical variable or more. I spent a lot of time researching how to create visualization with one, two, or more categorical variables. The solution is to use boxplots, grouped bar charts, jitter plots, and heat maps for the visualizations.
The result of this analysis shows that Prosper rating and term are two major indicators of the borrower rate. Other borrower statuses also have effects on borrower rate but their effects are either small or are also factors of Prosper rating. I think this analysis would be a good reference for anyone who wants to borrower money from a p2p lending platform.
I’d almost considering switching to another dataset as the more I’m looking at, the more I get confused and see no significant correlation between the variables. I was suspecting to see few obvious trend and correlation without much investigation but I was wrong. Although there are more than 30 plots in this report. Lots of charts were never appeared here as I did lots of back-of-the-envelope visualisation but wouldn’t make any good story. In addition to that, I have to go through each variable one by one, try to understand the meaning as I was not an expert in the financial industry.
I then categorize these variables into three different players: Prosper company, lenders & borrowers. Only then the variables makes a lot more sense & I’m trying to make assumption a long the way to explore it. At the end of the project, it did give me a sense of understanding for the lending business.
As I progress through the analysis, I only found Debt to Income ratio is the predictor of Lending Yield and Borrower Rate and the other variables seems not to play that big of a role. To expand to project, we can try to combine various variables and make a predictive model using logistic regression for a go/no-go investment decision or any classification algorigthm would make sense.